A fast and fully automated system for glaucoma detection using color fundus photographs

This paper presents a low computationally intensive and memory efficient convolutional neural network (CNN)-based fully automated system for detection of glaucoma, a leading cause of irreversible blindness worldwide. Using color fundus photographs, the system detects glaucoma in two steps. In the first step, the optic disc region is determined relying upon You Only Look Once (YOLO) CNN architecture. In the second step classification of ‘glaucomatous’ and ‘non-glaucomatous’ is performed using MobileNet architecture. A simplified version of the original YOLO net, specific to the context, is also proposed. Extensive experiments are conducted using seven state-of-the-art CNNs with varying computational intensity, namely, MobileNetV2, MobileNetV3, Custom ResNet, InceptionV3, ResNet50, 18-Layer CNN and InceptionResNetV2. A total of 6671 fundus images collected from seven publicly available glaucoma datasets are used for the experiment. The system achieves an accuracy and F1 score of 97.4% and 97.3%, with sensitivity, specificity, and AUC of respectively 97.5%, 97.2%, 99.3%. These findings are comparable with the best reported methods in the literature. With comparable or better performance, the proposed system produces significantly faster decisions and drastically minimizes the resource requirement. For example, the proposed system requires 12 times less memory in comparison to ResNes50, and produces 2 times faster decisions. With significantly less memory efficient and faster processing, the proposed system has the capability to be directly embedded into resource limited devices such as portable fundus cameras.

www.nature.com/scientificreports/ the optic cup to that of the disc, is among one of the most commonly adopted screening tests for glaucoma 11 .Figure 1 shows example healthy (Fig. 2a) and glaucomatous ONH (Fig. 1b).
Manual assessment of optic nerve head is a time-consuming, tedious and subjective task, which necessitates the development of automated methods.In recent years there has been an increasing interest to develop deep learning, also known as convolutional neural network (CNN),-based methods, for automated assessment of glaucoma 12 .The main benefit of CNN-based methods is their ability to engage the machine to learn the inherent imaging features relevant to the disease by itself 13 .CNN-based methods have been found to be more than 90% accurate in detection of glaucoma, comparable to human experts 14,15 .While these methods are found to be highly accurate, the majority of them requires manual cropping of optic disc region, and importantly almost all of them are highly computationally intensive; which makes it harder to integrate them into portable fundus cameras.
In contrast to other works in the context, in this study we aim to develop a fully automated and low computationally intensive CNN-based system that could run effectively on resource limited devices such as the imaging device (i.e., camera) itself, and/or smartphones.
Specific contributions of this study include: 1. Development of a fully automated glaucoma detection system for resource limited devices.2. A context dependent simplified version of YOLO Nano architecture, resulting improved performance.
3. Discovery of the fact that low-resource intensive CNNs are at least as good resource intensive CNNs in glaucoma.

Related work
Automated glaucoma detection methods can be largely divided into two groups: (1) traditional rule-based machine learning methods, and (2) deep learning-based methods.

Traditional rule-based machine learning methods
Traditional rule-based methods rely on predefined (or human crafted) features extracted from the image and fed into a machine learning classifier to generate a decision.The majority of traditional rule-based machine learning methods in glaucoma assessment focus on first segmenting the optic disc region [15][16][17] , and then training a classifier based on key statistical and/or physical features on this region.
In this section we review some of the recently proposed and well performing traditional rule-based machine learning methods.Septiarini et al. 18 put forward a multi-step technique for glaucoma detection.The images were first preprocessed to remove noise.Then histogram equalization was applied to improve the quality of the images.Finally, K-NN classifier was applied to differentiate between the glaucomatous and healthy images.They achieved an accuracy of 95.24% on a dataset of 84 fundus images.Maheshwari et al. 19 proposed an iterative approach  for glaucoma detection.The images were first iteratively decomposed using variational mode decomposition (VMD) techniques.ReliefF algorithm was employed for features extraction, and LS-SVM classifier was used for classification of glaucomatous or healthy.They achieved an accuracy of 95.19%.Agrawal et al. 20 proposed a multi-stage framework for glaucoma detection that include decomposition of the fundus images using the quasi bivariate variational mode decomposition (QB-VMD), followed by ReliefF method based feature extraction and finally SVM based classification.They achieved a tenfold cross validation accuracy of 86.13% on the RIM-ONE dataset.Kirar et al. 21proposed a glaucoma detection model using DWT and EWT based feature extraction.Support Vector Machine classifier was used for the classification of glaucomatous and healthy images.They achieved an accuracy of 83.57%, with sensitivity and specificity respectively 86.40%, and of 80.80%.Gour et al. 22 proposed a multi-step model to detect glaucoma using the fundus images.The images were first preprocessed through CLAHE histogram equalization technique.Feature selection was then performed relying upon PCA.Finally, classification was performed by SVM.On the HRF data set, they achieved accuracy of 79.20% and AUC of 86%.Mookiah et al. 23 used high-level statistical measures based on Fourier methods as primary features to be extracted for glaucoma assessment.An SVM was then trained on those features to classify glaucomatous against non-glaucomatous subjects.They achieved an accuracy of 95%, with sensitivity and specificity respectively 93.3% and 96.7%.Nayak et al.Sundaram et al. 24 developed a system for automated eye disease prediction inclusive of glaucoma, using bag of visual words and support vector machine (SVM).Local features were extracted using Speeded Up Robust Feature (SURF), which were then clustered to obtain Bag of Features/Visual Words, and finally classification was performed using SVM.A maximum classification accuracy of 92% was obtained.

Deep learning-based methods
Deep learning methods learn to extract complex features from the data by itself, without the need of predefined human crafted features.Convolutional neural network (CNN), a deep learning architecture, progressively learns more complex and abstract visual features from the images with increasing network depth.Because of superior performance over traditional rule-based machine learning methods, deep learning-based methods are becoming increasingly popular in recent years for biomedical image analysis including glaucoma assessment.
In this section we review the recent CNN-based methods for glaucoma detection using CFPs.Li et al. 25 trained an InceptionV3 model from the scratch using a dataset collected privately from various clinical settings in China.A total of 39,745 fundus images were used for train and validation of the model.While a total of 48,116 images were initially selected for this study, 7371 images were excluded because of poor quality or location.ADAM optimiser with a learning rate of 2 × 10 −3 was used.They used a batch size of 32.Christopher et al. 26 independently trained three CNNs, namely-VGG16, Inception and ResNet50.A total of 14,822 fundus images were used.The images were collected from two prospective longitudinal studies, The African Descent and Glaucoma Evaluation Study and The Diagnostic Innovations in Glaucoma Study.Both transfer learning and learning from scratch approaches were experimented while training the models.They found ResNet50 as the best performing model for glaucoma detection, and transfer learning as the best approach to train the model.
Raghavendra et al. 27 proposed a simple 18-layer CNN and applied for glaucoma detection.The model consists of 4 convolution blocks, a fully connected layer followed by soft-max classifier.Each convolution block contains 1 convolutional layer, 1 batch normalization layer, 1 ReLU and 1 max-pooling layer.Stochastic gradient descent (SGD) optimiser was used and the learning rate was trialed logarithmically from 10 −1 to 10 −4 .A total of 1426 fundus images collected from India was used in the experiment.
Al-Bander et al. 28 trained a DenseNet model using four publicly available data sets for detecting glaucoma.Pal et al. 29 developed specific CNN-based framework for glaucoma detection namely G-EyeNet.Liu et al. 30 developed a system relying upon ResNet.A list of preprocessing that are specific to glaucoma detection including image down-sampling, generation of optic disc centered images, were performed.SGD was used as the optimizer for the gradient propagation.A total of 269,601 fundus images collected from Chinese Glaucoma Study Alliance were used for training and validation of the system.Diaz-Pinto et al. 14 experimented on 5 different CNN architectures, namely, VGG16, ResNet, VGG19, Inception-V3, and Xception to assess glaucoma.The data set scans were cropped to get the OD regions.They obtained the best performance using Xception model.Commonly used pretrained models were used in the context by a list of authors; to mention a few, Gómez-Valverde et al. 31 , Phan et al. 32 .
Elangovan et al. 33 proposed an 18-layer CNN model to assess glaucoma.Publicly available data sets, namely, RIM-ONE2, DRISHTI-GS1, ACRIMA, ORIGA, LAG, were used for the experiments.Sreng et al. 34 developed a multi-stage framework for glaucoma detection.Initially, a DeepLabV3+ architecture segmented the optic disk.Afterwards, pre-trained CNNs were applied for feature extraction, and SVM was employed for the classification of glaucomatous and the healthy images.Srinivasa et al. 35 presented a 8-layer model for the detection of glaucoma.The model consists of 8 layers inclusive of 4 convolutional, and 4 connected layers.
Gheisari et al. 36 developed a model that combines CNN with the recurrent neural network for the enhanced glaucoma detection.They found improved glaucoma detection performance when temporal features, extracted from video, are also used in the detection along with the spatial features.They experimented with both VGG16 and ResNet models.Maheswari et al. 37 developed a technique for glaucoma diagnosis using deep learning and local descriptors-based augmentation.Chaudhary et al. 38 proposed a method, namely two dimensional Fourier-Bessel series expansion based empirical wavelet transform (2D-FBSE-EWT) for boundaries detection, and applied that for the decomposition of fundus images into sub-images.Glaucoma detection from sub-images were performed relying upon two methods: (1) Using conventional machine learning, (2) deep learning (ensemble ResNet-50).
Carvalho et al. 39 applied three-dimensional convolutional neural network (3DCNN) for diagnosis of glaucomatous and healthy fundus images.They developed a technique which converts two-dimensional fundus images to three-dimensional volumes to be used by the 3DCNN.Lin et al. 40 developed an algorithm named GlaucomaNet to classify glaucomatous and healthy eyes.GlaucomaNet consists of two convolutional neural networks, one performing preliminary grading and the other performing detailed grading.Fan et al. 41 comprehensively explored the generalizability and explainability of Vision Transformer deep learning technique in detecting glaucoma using fundus photographs.Data-efficient image Transformer (DeiT), and ResNet-50 models were compared.Shoukat et al. 42 trained ResNet model for the diagnosis of glaucoma.Image pre-processing is performed to enhance image quality prior to the assessment.Velpula et al. 43 developed a multi-stage glaucoma classification model using pre-trained CNNs and voting based classifier fusion.They utilized 5 pre-trained CNNs, namely ResNet-50, AlexNet, VGG16, DenseNet-201 and Inception-ResNet-v2.
Table 1 summarises the overall performance of these models.
A detailed breakdown of the datasets used in this study is provided in Table 2.
Excluding Drishti-GS and HRF, all datasets had image dimensionality less than 1000 × 1000 pixels.Among these datasets, only RIM-ONE images were provided cropped in the ONH regions.Since the ONH is the primary region of interest for glaucoma assessment 14,15 , in line, with other studies in the literature we manually cropped images from other datasets as well.Data augmentation, including − 5 to 5 degrees range rotation, shearing in the range of 0.2, and scaling in the range of 0.2, was performed, for both 'glaucomatous' and 'non-glaucomatous' images.

Proposed system
A two-step approach for glaucoma detection using color fundus photograph is proposed.Given a fundus photograph as input, in the first step ONH region is detection relying upon simplified You Only Look Once (YOLO) neural network architecture 50 .In the second step the ONH region is classified into 'glaucomatous' or 'nonglaucomatous' , relying upon MobileNetV3Small CNN 51 .The overall workflow of the system is shown in Fig. 2.
Step 1: optic nerve head (ONH) region detection The ONH is the primary region of interest for glaucoma assessment.ONH area is only a fraction of fundus image, thus processing only the ONH region ensures the optimal use of the CNN's learning capacity and improves the classification performance.
ONH region detection is performed based on YOLO architecture, which is extremely fast.In different to other competing region detection architecture such as Faster R-CNN 52 , YOLO utilizes features from all over the image to predict each bounding box.Further to that for an image, YOLO predicts all bounding boxes for all classes simultaneously.These design strategies enable YOLO to perform superfast object localization and comparatively less background mistakes than its competitors (e.g., R-CNN family of algorithms) 50 .
YOLO has a family of algorithms and in this study, we specifically focused on architectures that are designed for resource limited devices such as mobile phones.More particularly we focused on YOLO Nano 53 , and its variants namely, YOLO-v5 Nano 54 , and YOLO-v7 Tiny 55 , that are highly compact, super-fast and memory efficient algorithms.YOLO Nano 53 architecture relies upon principal 3 concepts: (1) Residual Projection-Expansion-Projection Macroarchitecture, (2) Fully-connected Attention Macroarchitecture, and (3) Macroarchitecture and Microarchitecture Heterogeneity.
Residual Projection-Expansion-Projection (PEP) microarchitecture comprises of: (i) a projection layer, that projects output channels into a lower dimensional output tensor, (ii) an expansion layer, that expands the number of channels, (iii) a depth-wise convolution layer that performs spatial convolutions, and (iv) a projection layer that projects output channels into a lower dimension output tensor. 1 × 1 convolutions are applied in the all the layers, except the depth-wise convolution layer, where 3 × 3 convolution is applied.Residual PEP macroarchitectures enables significant reductions in computational cost without compromising model expressiveness.
The Fully-connected Attention (FCA) microarchitecture comprises of two fully-connected layers to learn the dynamic, non-linear inter-dependencies between channels.FCA produces modulation weights to re-weight the channels.FCA facilitates dynamic feature recalibration and ensures better utilization of the available network capacity.
The Macroarchitecture and Microarchitecture Heterogeneity in YOLO Nano architecture is presented through a diverse mix of PEP modules, EP modules, FCA, as well as individual 3 × 3 and 1 × 1 convolution layers, and in terms of each module or layer having unique microarchitectures.Heterogeneity in the YOLO Nano architecture helps to achieve a very strong balance between architectural and computational complexity and model expressiveness.
Simplified YOLO Nano.The original YOLO Nano 53 follows the multi-scale prediction strategy of YOLO v3, or in other words, relies upon multiple-scale feature maps (13 × 13, 26 × 26, and 52 × 52 for input size of 416 × 416 pixels) to detect small objects.Among these feature maps, 52 × 52 feature map with the smallest receptive field is targeted for tiny objects 56 .For color fundus image, the ONH is not tiny, hence, the scale 52 × 52 of the original YOLO Nano could be hypothetically removed to speed up the calculation, without compromising the performance; and that is what has been done in this work.
Step 2: classification of 'glaucomatous' and 'non-glaucomatous' Classification of the ONH region (i.e., 'glaucomatous' and 'non-glaucomatous'), is performed relying upon independently trained MobileNetV3Small model.Extensive experiments were conducted to determine optimal CNN in the context.Seven state-of-art CNNs with varying computational cost and memory requirements, namely ResNet50, InceptionV3, InceptionResNetV2, MobileNetV2, MobileNetV3Small, 18-Layer CNN, Custom ResNet were experimented.Training of the CNNs were performed both using "transfer learning" and "learning from scratch" approaches 57 .
(1) Transfer learning: Available pre-trained ImageNet weights were used to initialize the internal weights of all the models.Random initializations were used for the added top layers of the models.Each of the models were trained in 2 consecutive steps.First a partial training was performed, where all the layers of the model, except the newly added layers were frozen and training was performed for 5 epochs using RMSProp optimiser and a learning rate of 10 −4 .Afterwards, a full training of the model was performed.Full training was performed for 200 epochs for each of the models in consideration.
Five of the seven CNNs, who had pretrained weights available, were experimented.For each of the models, a number of fine-tuning strategies were experimented while training.More specifically, we performed experiment by fine-tuning of the top 0-100% of the layers with 10% step size.Both the ADAM and RMSProp optimisers were experimented.Different learning rates were also tested.(2) Learning from scratch: All of the 7 models were experimented.For each of the models, random weights were used to initialize the model parameter, and a full training was performed, meaning all layers of the model are trained from the data.Glorot uniform function was used for all the models for initialization.We experimented on different learning rates and optimizers, namely ADAM and RMSProp.Training was performed for a total of 200 epochs for each model.

Performance measurement
To evaluate the performance of ONH region detection task, intersection over union (IoU) score between the ground truth rectangle and the rectangle(s) returned by the system (i.e., Simplified YOLO Nano) was used.The IoU score is mathematically defined as: where, G is the binary image containing the original rectangle (with inside filled), and O is the binary image containing rectangle(s) (with inside filled) returned by the system.
Dice coefficients, defined mathematically as below, was also used to evaluate the performance of the models.
where G and O were defined as above.
For the classification task sensitivity, specificity, accuracy, area under the curve (AUC) and F1 score were used.These metrics are defined mathematically as below.

ONH region detection
From the results it is observable that YOLO-v7 Tiny marginally outperforms other models in terms of Dice co-efficient, IoU score; however, the model consumes 3 times more memory than others.Simplified YOLO Nano, YOLO Nano, and YOLO-v5 Nano all have very similar performance, with simplified YOLO Nano marginally outperforming YOLO Nano.Simplified YOLO Nano also requires about 3% less computation time than YOLO Nano.

Classification of 'glaucomatous' and 'non-glaucomatous'
Table 4 summarizes the performance of all the models for "transfer learning" approach, and Table 5 summarise the performance for "learning from scratch" approach.
From Table 4, it is apparent that overall MobileNetV3Small outperforms all other models in consideration.In comparison to other models, MobileNetV3Small shows higher sensitivity, accuracy, AUC and F1 score, with comparable specificity.ResNet50 and InceptionResNetV2 are the next two models in terms of overall performance.ResNet50 shows higher sensitivity and F1 score in comparison to InceptionResNetV2, however, Incep-tionResNetV2 shows higher specificity and AUC.Both are found to produce identical accuracy.The difference in performance in any of these three models which include MobileNetV3Small, ResNet50 and InceptionResNetV2, are not to an extinct that would be relevant in practice.
From Table 5, it is apparent that overall MobileNetV3Small once again outperforms other models with higher specificity, accuracy, AUC and F1 score.InceptionResNetV2, InceptionV3 and ResNet50 shows marginally higher sensitivity than MobileNetV3Small.Following MobileNetV3Small, in terms of overall performance, InceptionResNetV2, InceptionV3 and ResNet50 are the next three models.
Unsurprisingly, with "transfer learning" higher performance of the models are observed in comparison to "learning from scratch".Overall, a 0.8% increase in sensitivity, a 2.24% increase in specificity, a 1.82% increase  in accuracy, a 0.9% increase in AUC and a 1.72% increase F1 score in comparison to "learning from scratch" approach is observable.Table 6 summarizes the memory requirements, and time to produce decision by the different CNN models in consideration.
From Table 6, it is apparent that MobileNetV3Small has the least memory requirement followed by Mobile-NetV2, 18-Layer CNN, and Custom ResNet.Custom ResNet requires the least amount of time to produce decision, followed by MobileNetV2 and MobileNetV3Small.ResNet50 requires 12 times more memory in comparison to MobileNetV3Small, and also takes twice the time of MobileNetV3Small to produce decision.The time and memory requirement for InceptionResNetV2 is respectively about 1.7 times and 2.2 times higher than ResNet50.InceptionV3 requires slightly less memory and time in comparison to ResNet50.
The training, and validation was performed on a Dell Precision 5820 Tower Workstation.The workstation had Intel Xeon 3.60 GHz CPU, 64 GB RAM, and an NVIDIA GeForce RTX 2080Ti GPU (Dell Inc., Round Rock, TX, USA) installed.
With the exception of Custom ResNet and 18-Layer CNN, all other CNNs were readily available in Keras, and were used without alteration.Custom ResNet and 18-Layer CNN were implemented following the description and guideline of the authors.ADAM and RMSProp optimisers were independently experimented for all of the models in consideration.We also experimented with different learning rates in the range 10 −5 -10 −3 .The results reported in the tables are the results of the optimal setup.
We also experimented with detecting glaucoma without performing the ONH region detection.The aim of this experiment was to justify the requirement of region detection as a priory step for glaucoma detection.Without ONH region detection, the performance of all the models deteriorate.For high computationally intensive models (e.g., Inception-v3, ResNet50, InceptionResNet-v2) an overall accuracy decrease of 2% was observed.For low computationally intensive models the accuracy drop where about 3%.Table 7 shows the findings.Only the CNNs who had available pretrained weights were used in this experiment.Also, RIM-ONE dataset was excluded in this experiment, since only cropped images were available.
For ONH region detection, we experimented with YOLO Nano, YOLO-v5 Nano, and YOLO-v7 Tiny.We also proposed context aware simplification strategy and experimented with the original YOLO Nano.Similar context aware simplification strategy could also be adopted for YOLO-v5 Nano, and YOLO-v7 Tiny.However, we did not find any suitable implementation of YOLO-v5 Nano, and YOLO-v7 Tiny readily available that could be customized as per the need.
In this study a total of 6671 images were available for training and validation.In order to better investigate the performance and overall acceptability of the model, in line with the literature 58 , a tenfold cross validation was performed.The images had varying image dimensionality.However, experimentally we did not observe any difference in performance that would be relevant in practise attributed to this.

Conclusions
Recent studies using CNN-based models for automated glaucoma detection have achieved performance comparable to human experts 59 .These studies limited their focus to performance, without considering the computational resource and/or time requirement to produce the decision.In this study we propose a fast and fully automated system for glaucoma detection, without compromising performance.The effectiveness of different CNNs architectures was experimentally evaluated using publicly available datasets.The proposed system relying upon simplified YOLO net (also proposed in this study) and MobileNetV3Small achieves an accuracy of 97.4% with sensitivity and specificity respectively 97.5% and 97.2%.The performance of the proposed system is at least as good as the highly computationally intensive state-of-the-art CNNs, experimented in this study, and also reported in the literature (Table 1).The best reported method in the literature produced an accuracy, sensitivity and specificity of respectively 99%, 99% and 99%.However, this study was conducted on a very small dataset of 705 images.Other methods reported in the literature are found overall less accurate than the proposed system.In our experiment, for ResNet50 the observed accuracy, sensitivity and specificity values are respectively 97.2%, 97.3% 97.1%.For InceptionResNetv2 these values are respectively 97.2%, 96.3%, and 97.5%.With comparable or better performance, the proposed system produces significantly faster decisions and drastically minimizes the resource requirements.For example, the proposed system requires 12 times less memory in comparison to ResNes50, and produces 2 times faster decisions.
Extensive experiments in this study also reveals some new findings into the literature that include: 1. Low computationally intensive CNNs, specifically, MobileNetV2 and MobileNetV3Small, are capable to detect glaucoma in color fundus photographs with comparable performance with the high computationally intensive CNNs. 2. Glaucoma could be detected more accurately by the CNNs, when only the ONH region rather than the whole image is fed to the models.
In future we would like to evaluate the performance of the system on resource limited hardware.

Figure 1 .
Figure 1.Example representation of (a) healthy, and (b) glaucomatous optic nerve head.Inner smaller circle is the optic cup and larger circle is the optic disc.

Figure 3
Figure3shows example ONH detection by YOLO Nano and simplified YOLO Nano (proposed) model.Table3quantitatively summarizes the performance of YOLO Nano, simplified YOLO Nano, YOLO-v5 Nano, and YOLO-v7 Tiny using dice coefficient and IoU score.From the results it is observable that YOLO-v7 Tiny marginally outperforms other models in terms of Dice co-efficient, IoU score; however, the model consumes 3 times more memory than others.Simplified YOLO Nano, YOLO Nano, and YOLO-v5 Nano all have very similar performance, with simplified YOLO Nano marginally outperforming YOLO Nano.Simplified YOLO Nano also requires about 3% less computation time than YOLO Nano.

Figure 3 .
Figure 3. Example ONH bounding box determined by YOLO Nano and simplified YOLO Nano on 2 different example images.The ground truth bounding box is shown in "green", bounding boxes determined by YOLO Nano and simplified YOLO Nano is respectively shown in "red" and "blue".

Table 1 .
Sensitivity, specificity, accuracy, and area under the curve (AUC) of different CNN-based methods in glaucoma detection.a Best performance when only spatial features are used.b Overall.c Overall, for binary classification.

Table 2 .
Breakdown of the images available in each dataset.To overcome the class imbalance, present in the 'glaucomatous' and 'non-glaucomatous' images, and to balance them, a random subset of 1577 glaucomatous images are selected, horizontally flipped and are saved as new images.

Table 3 .
Performance evaluation of YOLO Nano and simplified YOLO Nano.

Table 4 .
Performance evaluation of different models when transfer learning was used to train them.The values represented in this table are the average of the findings of tenfold cross validation.

Table 5 .
Performance evaluation of different models for "learning from scratch" approach.The values represented in this table are the average of the findings of tenfold cross validation.

Table 6 .
Memory requirements and time to produce decision by the CNNs.

Table 7 .
Performance evaluation of different CNN models when glaucoma detection was performed on the whole images.The values represented in this table are the average of the findings of tenfold cross validation.